68 research outputs found
Position-Aware Contrastive Alignment for Referring Image Segmentation
Referring image segmentation aims to segment the target object described by a
given natural language expression. Typically, referring expressions contain
complex relationships between the target and its surrounding objects. The main
challenge of this task is to understand the visual and linguistic content
simultaneously and to find the referred object accurately among all instances
in the image. Currently, the most effective way to solve the above problem is
to obtain aligned multi-modal features by computing the correlation between
visual and linguistic feature modalities under the supervision of the
ground-truth mask. However, existing paradigms have difficulty in thoroughly
understanding visual and linguistic content due to the inability to perceive
information directly about surrounding objects that refer to the target. This
prevents them from learning aligned multi-modal features, which leads to
inaccurate segmentation. To address this issue, we present a position-aware
contrastive alignment network (PCAN) to enhance the alignment of multi-modal
features by guiding the interaction between vision and language through prior
position information. Our PCAN consists of two modules: 1) Position Aware
Module (PAM), which provides position information of all objects related to
natural language descriptions, and 2) Contrastive Language Understanding Module
(CLUM), which enhances multi-modal alignment by comparing the features of the
referred object with those of related objects. Extensive experiments on three
benchmarks demonstrate our PCAN performs favorably against the state-of-the-art
methods. Our code will be made publicly available.Comment: 12 pages, 6 figure
CCLAP: Controllable Chinese Landscape Painting Generation via Latent Diffusion Model
With the development of deep generative models, recent years have seen great
success of Chinese landscape painting generation. However, few works focus on
controllable Chinese landscape painting generation due to the lack of data and
limited modeling capabilities. In this work, we propose a controllable Chinese
landscape painting generation method named CCLAP, which can generate painting
with specific content and style based on Latent Diffusion Model. Specifically,
it consists of two cascaded modules, i.e., content generator and style
aggregator. The content generator module guarantees the content of generated
paintings specific to the input text. While the style aggregator module is to
generate paintings of a style corresponding to a reference image. Moreover, a
new dataset of Chinese landscape paintings named CLAP is collected for
comprehensive evaluation. Both the qualitative and quantitative results
demonstrate that our method achieves state-of-the-art performance, especially
in artfully-composed and artistic conception. Codes are available at
https://github.com/Robin-WZQ/CCLAP.Comment: 8 pages,13 figure
Semantic Graph Representation Learning for Handwritten Mathematical Expression Recognition
Handwritten mathematical expression recognition (HMER) has attracted
extensive attention recently. However, current methods cannot explicitly study
the interactions between different symbols, which may fail when faced similar
symbols. To alleviate this issue, we propose a simple but efficient method to
enhance semantic interaction learning (SIL). Specifically, we firstly construct
a semantic graph based on the statistical symbol co-occurrence probabilities.
Then we design a semantic aware module (SAM), which projects the visual and
classification feature into semantic space. The cosine distance between
different projected vectors indicates the correlation between symbols. And
jointly optimizing HMER and SIL can explicitly enhances the model's
understanding of symbol relationships. In addition, SAM can be easily plugged
into existing attention-based models for HMER and consistently bring
improvement. Extensive experiments on public benchmark datasets demonstrate
that our proposed module can effectively enhance the recognition performance.
Our method achieves better recognition performance than prior arts on both
CROHME and HME100K datasets.Comment: 12 Page
ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation
In addition to the unprecedented ability in imaginary creation, large
text-to-image models are expected to take customized concepts in image
generation. Existing works generally learn such concepts in an
optimization-based manner, yet bringing excessive computation or memory burden.
In this paper, we instead propose a learning-based encoder, which consists of a
global and a local mapping networks for fast and accurate customized
text-to-image generation. In specific, the global mapping network projects the
hierarchical features of a given image into multiple new words in the textual
word embedding space, i.e., one primary word for well-editable concept and
other auxiliary words to exclude irrelevant disturbances (e.g., background). In
the meantime, a local mapping network injects the encoded patch features into
cross attention layers to provide omitted details, without sacrificing the
editability of primary concepts. We compare our method with existing
optimization-based approaches on a variety of user-defined concepts, and
demonstrate that our method enables high-fidelity inversion and more robust
editability with a significantly faster encoding process. Our code is publicly
available at https://github.com/csyxwei/ELITE.Comment: Accepted by ICCV 2023, oral presentation. Code:
https://github.com/csyxwei/ELIT
Patch Is Not All You Need
Vision Transformers have achieved great success in computer visions,
delivering exceptional performance across various tasks. However, their
inherent reliance on sequential input enforces the manual partitioning of
images into patch sequences, which disrupts the image's inherent structural and
semantic continuity. To handle this, we propose a novel Pattern Transformer
(Patternformer) to adaptively convert images to pattern sequences for
Transformer input. Specifically, we employ the Convolutional Neural Network to
extract various patterns from the input image, with each channel representing a
unique pattern that is fed into the succeeding Transformer as a visual token.
By enabling the network to optimize these patterns, each pattern concentrates
on its local region of interest, thereby preserving its intrinsic structural
and semantic information. Only employing the vanilla ResNet and Transformer, we
have accomplished state-of-the-art performance on CIFAR-10 and CIFAR-100, and
have achieved competitive results on ImageNet
Optimization of network structure to random failures
Network's resilience to the malfunction of its components has been of great
concern. The goal of this work is to determine the network design guidelines,
which maximizes the network efficiency while keeping the cost of the network
(that is the average connectivity) constant. With a global optimization method,
memory tabu search (MTS), we get the optimal network structure with the
approximately best efficiency. We analyze the statistical characters of the
network and find that a network with a small quantity of hub nodes, high degree
of clustering may be much more resilient to perturbations than a random network
and the optimal network is one kind of highly heterogeneous networks. The
results strongly suggest that networks with higher efficiency are more robust
to random failures. In addition, we propose a simple model to describe the
statistical properties of the optimal network and investigate the
synchronizability of this model.Comment: 11 pages, 6 figures, accepted by Physica
Local Global Relational Network for Facial Action Units Recognition
Many existing facial action units (AUs) recognition approaches often enhance the AU representation by combining local features from multiple independent branches, each corresponding to a different AU. However, such multi-branch combination-based methods usually neglect potential mutual assistance and exclusion relationship between AU branches or simply employ a pre-defined and fixed knowledge-graph as a prior. In addition, extracting features from pre-defined AU regions of regular shapes limits the representation ability. In this paper, we propose a novel Local Global Relational Network (LGRNet) for facial AU recognition. LGRNet mainly consists of two novel structures, i.e., a skip-BiLSTM module which models the latent mutual assistance and exclusion relationship among local AU features from multiple branches to enhance the feature robustness, and a feature fusion&refining module which explores the complementarity between local AUs and the whole face in order to refine the local AU features to improve the discriminability. Experiments on the BP4D and DISFA AU datasets show that the proposed approach outperforms the state-of-the-art methods by a large margin
Experimental Investigation on the DLC Film Coating Technology in Scroll Compressors of Automobile Air Conditioning
The friction of the orbiting scroll leads to large power consumption and low energy efficiency of the scroll compressor. The common methods to solve this problem are high cost and a complex process. Considering special structures and operating principles to apply the coating technology on the scroll compressor is a new subject. Given the material of the orbiting scroll being aluminum alloy, the unbalanced magnetron sputtering technology for the orbiting scroll of the scroll compressor was chosen and the Cr transition layer was coated to enhance the bonding strength. Moreover, we innovatively performed an experiment to verify the feasibility of unbalanced magnetron sputtering film coating technology for the diamond-like carbon film coated in the scroll compressor. This article elaborates the parameter test methods of the film properties before and after experiments and the experimental system components. The results showed that the diamond-like carbon film has low coefficient and high bonding strength, which renders it a good wear-reducing effect and an excellent self-lubricating property. Due to the thin film layer and high operating temperature, the thickness should be increased to raise the abrasion resistance. The refrigeration system with the scroll compressor coated with the diamond-like carbon film can satisfy the national standard conditions with low Vickers hardness. Its performance was improved at low speed. Therefore, the unbalanced magnetron sputtering with increased Cr bond layer is a feasible and appropriate technology for coating diamond-like carbon film
- …